Add inductor_benchmarker as latency measurement option #333

yf225 · 2025-08-18T21:15:56Z

As discussed in https://fb.workplace.com/groups/257735836456307/posts/967097558853461/?comment_id=971504168412800&reply_comment_id=971545435075340, the normal triton.testing.do_bench measurement includes torch.compile Dynamo invocation overhead and doesn't reflect real-world model use case where Dynamo overhead is usually hidden.

This PR adds an option to use Inductor benchmarker as the timing measurement tool, which uses cuda event for timing measurement and thus more accurately measuring only the CUDA kernel runtime.

Usage: --latency-measure-mode inductor_benchmarker

FindHao · 2025-08-18T21:44:16Z

Besides that, I recalled @xuzhao9 mentioned somewhere we can use kineto's trace to sum only GPU time.

FindHao · 2025-08-18T21:51:16Z

I'm not sure about this change. Will this benchmark function change the results for pure handwritten triton kernel's results? If it matches the results of triton.testing.do_bench for other triton kernels, I feel we should just use this new function. If not, we need to understand what are the differences and see how to fix it.

Adding a new latency_measure_mode doesn't sound good because the results are diverged.

what do you think? @xuzhao9

FindHao · 2025-08-18T21:53:30Z

which uses cuda event for timing measurement and thus more accurately measuring only the CUDA kernel runtime.

https://github.com/triton-lang/triton/blob/37f265932b68868021b2fade6354b44e613dc124/python/triton/testing.py#L163
do_bench also use cuda event for timling. I may need to double check the implementation of benchmarker.benchmark_gpu

BoyuanFeng · 2025-08-18T22:03:31Z

@FindHao benchmarking.benchmark_gpu seems to be better for benchmarks than triton.testing.do_bench. Please check this scripts for comparison: P1908752761

use kineto's trace to sum only GPU time.

This sounds good! It would be great if we have a pytorch utility to benchmark only gpu time. We can converge all kernel benchmarks to use it if possible.

FindHao · 2025-08-18T23:08:18Z

benchmarking.benchmark_gpu seems to be better for benchmarks than triton.testing.do_bench.

Can you clarify what's the differences between these two implementations?

What I mean is: if you claim that the Dynamo overhead is completely hidden in real runs, I am willing to trust that. However, we still need to verify that the results from benchmark_gpu are consistent with those from do_bench on other handwritten Triton kernels.

Alternatively, measuring only the pure GPU time may be misleading, since it could overlook real Python or Triton runtime overhead.

yf225 · 2025-08-18T23:24:38Z

measuring only the pure GPU time may be misleading, since it could overlook real Python or Triton runtime overhead

Would we be open to having a mode that measures only GPU time? I believe for real-world model use causes, the kernel launch or Python overhead is usually hidden by the previous kernel (assuming no bubble), so having a mode that only measures GPU time is useful and reflects that scenario too.

FindHao · 2025-08-18T23:43:17Z

measuring only the pure GPU time may be misleading, since it could overlook real Python or Triton runtime overhead

Would we be open to having a mode that measures only GPU time? I believe for real-world model use causes, the kernel launch or Python overhead is usually hidden by the previous kernel (assuming no bubble), so having a mode that only measures GPU time is useful and reflects that scenario too.

Try --metrics cuda_time.

tritonbench/docs/metrics.md

Line 15 in a404ea7

    
           | `cuda_time`     | Sum of all GPU-side kernels time of an operator backend, measured by Kineto and PyTorch Profiler. |

Is this what you need?

Common benchmark suites like TritonBench uses `triton.testing.do_bench` for kernel timing measurement which is not always fair for all backends. E.g. it includes torch.compile Dynamo invocation overhead and hence doesn't reflect real-world model use case where Dynamo overhead is usually hidden. I also opened a PR to use this timing measurement function on TritonBench side: meta-pytorch/tritonbench#333. But regardless of whether that PR can land, I think we should enhance Inductor benchmark_gpu to match do_bench features, to make it easier to people to migrate. Pull Request resolved: #160921 Approved by: https://github.com/BoyuanFeng

xuzhao9 · 2025-08-20T19:19:14Z

Due to the limitation of triton's do_bench/do_bench_cudagraph, it is beneficial to have inductor benchmarker and more latency measurement modes (e.g. power-limit-aware benchmarking) in the future.

xuzhao9 · 2025-08-20T04:45:32Z

tritonbench/components/do_bench/run.py

+                x.grad = None
+
+        # Measure only the function execution time
+        ms_time = benchmarker.benchmark_gpu(fn)


I am wondering does inductor benchmarker use cudagraph?

I believe it doesn't use cudagraph but use CUDA event timing (e.g. start_event.elapsed_time(end_event))

…ch#160921) Common benchmark suites like TritonBench uses `triton.testing.do_bench` for kernel timing measurement which is not always fair for all backends. E.g. it includes torch.compile Dynamo invocation overhead and hence doesn't reflect real-world model use case where Dynamo overhead is usually hidden. I also opened a PR to use this timing measurement function on TritonBench side: meta-pytorch/tritonbench#333. But regardless of whether that PR can land, I think we should enhance Inductor benchmark_gpu to match do_bench features, to make it easier to people to migrate. Pull Request resolved: pytorch#160921 Approved by: https://github.com/BoyuanFeng

yf225 requested review from xuzhao9, FindHao, eellison and BoyuanFeng August 18, 2025 21:15

yf225 requested a deployment to docker-s3-upload August 18, 2025 21:16 — with GitHub Actions In progress

yf225 had a problem deploying to docker-s3-upload August 18, 2025 21:16 — with GitHub Actions Error

meta-cla bot added the cla signed label Aug 18, 2025

yf225 force-pushed the inductor_benchmarker branch from a83ce70 to 37509ff Compare August 18, 2025 21:16

yf225 had a problem deploying to docker-s3-upload August 18, 2025 21:16 — with GitHub Actions Error

yf225 mentioned this pull request Aug 18, 2025

Make Inductor benchmarker more compatible with Triton do_bench pytorch/pytorch#160921

Closed

xuzhao9 approved these changes Aug 20, 2025

View reviewed changes

yf225 had a problem deploying to docker-s3-upload August 27, 2025 23:56 — with GitHub Actions Failure

Add inductor_benchmarker as latency measurement option

1a758b2

yf225 force-pushed the inductor_benchmarker branch from 37509ff to 1a758b2 Compare September 2, 2025 19:52

yf225 temporarily deployed to docker-s3-upload September 2, 2025 19:52 — with GitHub Actions Inactive

lint

3e99784

yf225 temporarily deployed to docker-s3-upload September 2, 2025 20:07 — with GitHub Actions Inactive

yf225 merged commit 206b93c into main Sep 2, 2025
6 checks passed

yf225 deleted the inductor_benchmarker branch September 2, 2025 20:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add inductor_benchmarker as latency measurement option #333

Add inductor_benchmarker as latency measurement option #333

yf225 commented Aug 18, 2025 •

edited

Loading

Uh oh!

FindHao commented Aug 18, 2025

Uh oh!

FindHao commented Aug 18, 2025 •

edited

Loading

Uh oh!

FindHao commented Aug 18, 2025 •

edited

Loading

Uh oh!

BoyuanFeng commented Aug 18, 2025 •

edited

Loading

Uh oh!

FindHao commented Aug 18, 2025 •

edited

Loading

Uh oh!

yf225 commented Aug 18, 2025

Uh oh!

FindHao commented Aug 18, 2025 •

edited

Loading

Uh oh!

xuzhao9 commented Aug 20, 2025

Uh oh!

xuzhao9 Aug 20, 2025

Uh oh!

yf225 Sep 2, 2025

Uh oh!

Uh oh!

Uh oh!

Add inductor_benchmarker as latency measurement option #333

Add inductor_benchmarker as latency measurement option #333

Conversation

yf225 commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FindHao commented Aug 18, 2025

Uh oh!

FindHao commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FindHao commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BoyuanFeng commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FindHao commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yf225 commented Aug 18, 2025

Uh oh!

FindHao commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xuzhao9 commented Aug 20, 2025

Uh oh!

xuzhao9 Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

yf225 Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

yf225 commented Aug 18, 2025 •

edited

Loading

FindHao commented Aug 18, 2025 •

edited

Loading

FindHao commented Aug 18, 2025 •

edited

Loading

BoyuanFeng commented Aug 18, 2025 •

edited

Loading

FindHao commented Aug 18, 2025 •

edited

Loading

FindHao commented Aug 18, 2025 •

edited

Loading